Goto

Collaborating Authors

 training regime


Appendix Potential Negative Societal Impacts

Neural Information Processing Systems

C.3 Other Differences Besides the above discussion, there are some other differences between Daniely [12] and our work. First, they analyze SGD, and we analyze a constrained optimization problem and projected SGD. This may be the reason why we can get a stronger bound on width. In the experiments in Section 5, we observe that SGD performs badly when the width is small (see the first left column in (b), Figure 4). Therefore, we suspect an algorithmic change is needed to train narrow nets with such width (due to the training difficulty), and we indeed propose a new method to train narrow nets. Second, they consider binary {+1, 1}dataset, while our results apply to arbitrary labels. In addition, their proof seems to be highly dependent on the fact that the labels are {+1, 1}, and seems hard to generalize to general labels.


When Expressivity Meets Trainability: Fewer than n Neurons Can Work

Neural Information Processing Systems

Modern neural networks are often quite wide, causing large memory and computation costs. It is thus of great interest to train a narrower network. However, training narrow neural nets remains a challenging task. We ask two theoretical questions: Can narrow networks have as strong expressivity as wide ones? If so, does the loss function exhibit a benign optimization landscape?





5aea56eefab60e06f35016478e21aae6-Supplemental-Conference.pdf

Neural Information Processing Systems

A.2 DerivationsforSection3.1 We begin with a formal derivation of the formulas in Section 3.1. We remind that we consider a function F(ฮธ) whose parameters can be split inton SI groups: ฮธ = (ฮธ1,...,ฮธn). We solve an optimization problem(1)with projected gradient descent(2). Remark2 The above formulation allegedly lacks the third (divergent) regime. If, conversely, ฮท > 1Pn i=1ฮฑi, then at each iteration at least one of the individual ELRs exceeds its convergencethreshold: ฮทi > 1ฮฑi.



A Training Regime

Neural Information Processing Systems

For the Spectral Mixture Kernel, we use 4 mixtures. The CNF component for our model was inspired by FFJORD. For NGGP, we use the same CNF component architecture as in for the sines dataset. Adding noise allows for better performance when learning with the CNF component. We also use the same CNF component architecture as in the sines dataset. For this dataset, we tested NGGP and DKT models with RBF and Spectral kernels only.


Appendix PotentialNegativeSocietalImpacts

Neural Information Processing Systems

In this paper, we discuss the expressivity and trainability of narrow neural networks. Appendix H introduces the following contents.